This is the code used for my exploration of NBC’s 2016 Twitter Troll Dataset. It heavily uses packages found in the tidyverse such as dplyr, purrr, stringr, tidyr, and ggplot2. If you haven’t used many tidyverse packages, then a good resource to learn more about them is Grolemund and Wickham’s R for Data Science. Also, a particularly good resource to learn about purrr - a package for iterating over lists and vectors - then please read Bryan’s purrr tutorial. Also, I do not focus on graph visualization in this tutorial. If you want to learn more about graph/network visualization’s in R and igraph, then please check out Ognyanova’s AMAZING graph visualization tutorials.
Because the code here depends on so many different packages, I explicitly prependded each function with the package that it came from. This takes the format of package::function. While this is a lot more verbose and adds clutter to the tutorial, it makes it clear what functions you are using and where the functions are coming from.
Finally, the only package I will load is the magrittr package. This is because the %>% pipe function makes it easy to logically order functions. Essentially, it allows users to “unnest” functions. For example, the following function is difficult to read because the logic follows from the innermost function outwards.
unlist(strsplit(toupper('hello world'), ' '))
## [1] "HELLO" "WORLD"
However, written with the %>% pipe, we can reason about the code piecemeal.
library(magrittr)
'hello world' %>%
toupper() %>%
strsplit(' ') %>%
unlist()
## [1] "HELLO" "WORLD"
If the above code confuses you, try to think that the output of the first function is the first input of the second function, the output of the second function is the first input of the first function, and so on…
To proceed with this tutorial you should have the following installed:
install.packages(`dplyr`,
`ggplot2`,
'igraph',
`purrr`,
`readr`,
'stm',
`stringr`
`tidyr`,
'tidytext'
)
Before anything, let’s load the data and look at what columns we have available.
#load data
tweets <- readr::read_csv('tweets.csv')
names(tweets)
## [1] "user_id" "user_key"
## [3] "created_at" "created_str"
## [5] "retweet_count" "retweeted"
## [7] "favorite_count" "text"
## [9] "tweet_id" "source"
## [11] "hashtags" "expanded_urls"
## [13] "posted" "mentions"
## [15] "retweeted_status_id" "in_reply_to_status_id"
To create a retweeting network, we only need two columns from this data set - user_key and text. We can isolate these two columns with dplyr::select:
retweet_network <- tweets %>%
dplyr::select(user_key, text) %>%
dplyr::mutate(text = stringr::str_replace_all(text, "\\r|\\n", '') ) #clean text of newlines
retweet_network %>%
head() %>%
knitr::kable()
| user_key | text |
|---|---|
| ryanmaxwell_1 | #IslamKills Are you trying to say that there were no terrorist attacks in Europe before refugees were let in? |
| detroitdailynew | Clinton: Trump should’ve apologized more, attacked less https://t.co/eJampkoHFZ |
| cookncooks | RT @ltapoll: Who was/is the best president of the past 25 years? (Vote & Retweet) |
| queenofthewo | RT @jww372: I don’t have to guess your religion! #ChristmasAftermath |
| mrclydepratt | RT @Shareblue: Pence and his lawyers decided which of his official emails the public could seehttps://t.co/HjhPguBK1Y by @alisonrose711 |
| giselleevns | @ModicaGiunta me, too! |
Tweets that are actually retweets begin with RT. Because we only care about tweets that were retweeted, we can use dplyr::filter to only include instances of retweets. stringr::str_detect returns a boolean (TRUE/FALSE) for text that includes a string we are looking for.
retweet_network <- retweet_network %>%
dplyr::filter(stringr::str_detect(text, '^RT\\s'))
retweet_network %>%
head() %>%
knitr::kable()
| user_key | text |
|---|---|
| cookncooks | RT @ltapoll: Who was/is the best president of the past 25 years? (Vote & Retweet) |
| queenofthewo | RT @jww372: I don’t have to guess your religion! #ChristmasAftermath |
| mrclydepratt | RT @Shareblue: Pence and his lawyers decided which of his official emails the public could seehttps://t.co/HjhPguBK1Y by @alisonrose711 |
| baobaeham | RT @MDBlanchfield: You’ll never guess who tweeted something false that he saw on TV - The Washington Post https://t.co/K2e4XdXRfu |
| judelambertusa | RT @100PercFEDUP: New post: WATCH: DIAMOND AND SILK Rip On John Kerry Over Israel Comments (VIDEO) https://t.co/NkdKaQ9yYu |
| ameliebaldwin | RT @AriaWilsonGOP: 3 Women Face Charges After Being Caught Stealing Dozens Of Trump Signs https://t.co/JjlZxaW3JN https://t.co/qW2Ok9ROxH |
There is a clear pattern in the text. The character string RT is always followed by the twitter handle that is being retweeted. We can use stringr::str_extract to pull the twitter handles and dplyr::mutate to create a new column with this extracted information.
The code below includes (?<=@)[^:]+(?=:). This is a regex string. Click this link to learn more about regex. The other functions are used for cleaning and formatting purposes.
retweet_network <- retweet_network %>%
dplyr::mutate(retweeted_account = stringr::str_extract(text, '(?<=@)[^:]+(?=:)') %>%
stringr::str_to_lower()) %>% #standardize twitter handles to lower case
dplyr::filter(!is.na(retweeted_account)) %>% #remove text that starts with "RT" but aren't actually retweets
dplyr::select(user_key, retweeted_account, text) %>% #reorder columns
dplyr::distinct()
retweet_network %>%
head() %>%
knitr::kable()
| user_key | retweeted_account | text |
|---|---|---|
| cookncooks | ltapoll | RT @ltapoll: Who was/is the best president of the past 25 years? (Vote & Retweet) |
| queenofthewo | jww372 | RT @jww372: I don’t have to guess your religion! #ChristmasAftermath |
| mrclydepratt | shareblue | RT @Shareblue: Pence and his lawyers decided which of his official emails the public could seehttps://t.co/HjhPguBK1Y by @alisonrose711 |
| baobaeham | mdblanchfield | RT @MDBlanchfield: You’ll never guess who tweeted something false that he saw on TV - The Washington Post https://t.co/K2e4XdXRfu |
| judelambertusa | 100percfedup | RT @100PercFEDUP: New post: WATCH: DIAMOND AND SILK Rip On John Kerry Over Israel Comments (VIDEO) https://t.co/NkdKaQ9yYu |
| ameliebaldwin | ariawilsongop | RT @AriaWilsonGOP: 3 Women Face Charges After Being Caught Stealing Dozens Of Trump Signs https://t.co/JjlZxaW3JN https://t.co/qW2Ok9ROxH |
We don’t necessarily care about the individual tweets. What we really care about is who a troll retweeted and how often. We can use dplyr::count to aggregate the total number of instances a particular user_key retweeted another acount
retweet_network_wt <- retweet_network %>%
dplyr::count(user_key, retweeted_account, sort = T)
retweet_network_wt %>%
head() %>%
knitr::kable()
| user_key | retweeted_account | n |
|---|---|---|
| paulinett | blicqer | 1049 |
| melanymelanin | blicqer | 798 |
| paulinett | zaibatsunews | 222 |
| giselleevns | chrixmorgan | 218 |
| brianaregland | feministajones | 200 |
| hyddrox | cmdorsey | 178 |
This is a data frame with 81862. This data frame represents an edge list and so we may have too many edges to get a good understanding of our data. We can filter the edges down by choosing a cutoff for the edges. That is, if we assume that a troll retweeting an account less than 5 times is insignificant for our analysis, then we can remove them and clean up our graph.
filter_n <- 5
retweet_network_wt <- retweet_network_wt %>%
dplyr::filter(n >= filter_n)
retweet_network_wt %>%
head() %>%
knitr::kable()
| user_key | retweeted_account | n |
|---|---|---|
| paulinett | blicqer | 1049 |
| melanymelanin | blicqer | 798 |
| paulinett | zaibatsunews | 222 |
| giselleevns | chrixmorgan | 218 |
| brianaregland | feministajones | 200 |
| hyddrox | cmdorsey | 178 |
Cool, now we have only 3675 edges in our graph. Let’s now convert the edge list data frame into an actual igraph graph. We can do this using igraph::graph_from_data_frame
g_rtwt <- igraph::graph_from_data_frame(retweet_network_wt)
summary(g_rtwt)
## IGRAPH 41eff6f DN-- 1555 3675 --
## + attr: name (v/c), n (e/n)
The D at the top of the summary stands for Directed Graph, that is, direction matters for the edges. The N stands for Named Graph, that is, each node has a unque name. We can add metadata to the graph directly with $ notation - similar to how we would in a list. Any valid name for a list or data frame will be a valid name for a graph attribute. The number of nodes (1555) and the number of edges (3675) follows the graph’s metadata. We are also given a list of edge attributes (e/c(haracter), e/n(numeric), e/l(ogical)) and vertex attributes (v/…).
g_rtwt$name <- '2016 Russian Twitter Troll Retweet Network'
g_rtwt$info <- "A graph inspired by NBC's and Neo4j's exploration."
summary(g_rtwt)
## IGRAPH 41eff6f DN-- 1555 3675 -- 2016 Russian Twitter Troll Retweet Networ
## + attr: name (g/c), info (g/c), name (v/c), n (e/n)
The name attribute is a special attribute for a graph and is shown in the summary. We can use $ notation to retrieve other graph attributes.
g_rtwt$info
## [1] "A graph inspired by NBC's and Neo4j's exploration."
If we want to plot a graph, then we simply need to use the plot function. For igraph graphs, plot takes special parameters to manipulate the vertices and edges on the plot. We will not go into depth about plotting graphs, but if you want to learn more, then please visit Katya Ognyanova’s detail tutorials on graph visualization.. You can also use ?igraph.plotting to learn more.
set.seed(4321)
plot(
g_rtwt,
vertex.size = 2,
vertex.label = '',
edge.arrow.size = .05,
edge.width = .25,
asp = 0 #aspect ratio
)
igraph can readily calculate many different centrality measurements such as igraph::betweenness, igraph::degree, and igraph::eigen_centrality. We will focus on igraph::page_rank. These functions return scores at the node level and the order of the scores correspond with the order of the vertices.
pr <- igraph::page_rank(g_rtwt)$vector
head(pr)
## paulinett melanymelanin giselleevns brianaregland hyddrox
## 0.0005891945 0.0005891945 0.0007005381 0.0005891945 0.0005891945
## tpartynews
## 0.0018363397
head(igraph::V(g_rtwt)$name)
## [1] "paulinett" "melanymelanin" "giselleevns" "brianaregland"
## [5] "hyddrox" "tpartynews"
Because these measurements have the same order of the vertices, we can store the measurements as a vertex attribute.
igraph::V(g_rtwt)$PageRank <- pr
igraph::V(g_rtwt)[[1:6]]
## + 6/1555 vertices, named, from 41eff6f:
## name PageRank
## 1 paulinett 0.0005891945
## 2 melanymelanin 0.0005891945
## 3 giselleevns 0.0007005381
## 4 brianaregland 0.0005891945
## 5 hyddrox 0.0005891945
## 6 tpartynews 0.0018363397
If we want to match the vertex information with outside information, the easiest way we can do that is convert the vertex list into a data frame with igraph::as_data_frame and then join it with the new data with dplyr::?_join. Let’s combine the vertex list with a troll’s total number of tweets.
vertex_df <- igraph::as_data_frame(g_rtwt, 'vertices') %>%
dplyr::arrange(desc(PageRank))
edges_df <- igraph::as_data_frame(g_rtwt, 'edges')
total_tweets <- tweets %>%
dplyr::select(user_key, text) %>%
dplyr::count(user_key) %>%
dplyr::rename(TotalTweets = n)
vertex_df <- dplyr::left_join(vertex_df, total_tweets, by = c('name' = 'user_key'))
vertex_df %>%
head() %>%
knitr::kable()
| name | PageRank | TotalTweets |
|---|---|---|
| rt_com | 0.0050983 | NA |
| blacktolive | 0.0039910 | 238 |
| gloed_up | 0.0038694 | 327 |
| chiefplan1 | 0.0038029 | NA |
| rt_america | 0.0035941 | NA |
| ten_gop | 0.0029589 | 3194 |
If the TotalTweets of a node is NA, then the account is not listed in the list of trolls. This means the twitter trolls are retweeting tweets from real accounts. Let’s recreate the network and use the TotalTweets vertex attribute as something to filter out. We can actuaally remove vertices from a graph with -.
g_rtwt <- igraph::graph_from_data_frame(edges_df, T, vertex_df) %>%
{. - igraph::V(.)[is.na(TotalTweets)]} %>%
{. - igraph::V(.)[igraph::degree(.) == 0]} #remove unconnected nodes
summary(g_rtwt)
## IGRAPH 5dee1a6 DN-- 83 153 --
## + attr: name (v/c), PageRank (v/n), TotalTweets (v/n), n (e/n)
Let’s re-plot the graph.
set.seed(4321)
g_rtwt %>%
plot(
vertex.size = igraph::V(.)$PageRank/max(igraph::V(.)$PageRank) * 5 + 2,
vertex.label = '',
edge.arrow.size = .05,
edge.width = .25,
asp = 0 #aspect ratio
)
igraph has a number of community detection alogrithms to use including igraph::informap.community, igraph::spinglass.community, and igraph::fastgreedy.community. Here, we will use igraph::walktrap.community.
g_community <- igraph::walktrap.community(graph = g_rtwt)
names(g_community)
## [1] "merges" "modularity" "membership" "names" "vcount"
## [6] "algorithm"
The the community membership is listed in the same order as the vertices. This means we can store the membership as a vertex attribute. The communities are repesented as numbers. This particular graph has max(g_community$membership) communities. We can create a color palette for these communities.
igraph::V(g_rtwt)$community <- g_community$membership
community_pal <- scales::brewer_pal('qual')(max(igraph::V(g_rtwt)$community))
names(community_pal) <- 1:max(igraph::V(g_rtwt)$community)
community_pal
## 1 2 3 4 5
## "#7FC97F" "#BEAED4" "#FDC086" "#FFFF99" "#386CB0"
color is a special vertex attribute. If it exists, then the color stored in the vertex is automatically plotted. Let’s iterate over the vertice and assign a color according to it’s community
igraph::V(g_rtwt)$color <- purrr::map_chr(igraph::V(g_rtwt)$community, function(x){
community_pal[[x]]
})
set.seed(4321)
g_rtwt %>%
plot(
vertex.size = igraph::V(.)$PageRank/max(igraph::V(.)$PageRank) * 5 + 2,
vertex.label = '',
edge.arrow.size = .05,
edge.width = .25,
asp = 0 #aspect ratio
)
Let’s take a moment to actually analyze the hashtags associated with these users. We can go back to our original dataset and try to match users with hashtags.
tweet_hashtag <- tweets %>%
dplyr::select(user_key, hashtags, text) %>%
dplyr::distinct()%>%
dplyr::filter(hashtags != '[]') %>% #this represents a tweet with no hashtags
dplyr::mutate(hashtags = purrr::map(hashtags, jsonlite::fromJSON)) %>% #the stored info is a json file
tidyr::unnest() %>%
dplyr::select(user_key, hashtags)
Join the hashtag data with the vertex data, then aggregate
vertex_df <- igraph::as_data_frame(g_rtwt, 'vertices')
edge_df <- igraph::as_data_frame(g_rtwt, 'edges')
community_hashtags <- vertex_df %>%
dplyr::left_join(tweet_hashtag, by = c('name' = 'user_key')) %>%
dplyr::count(community, hashtags, sort = T) %>%
dplyr::group_by(community) %>%
dplyr::top_n(6, wt = n) %>%
dplyr::ungroup()
community_hashtags %>%
head() %>%
knitr::kable()
| community | hashtags | n |
|---|---|---|
| 4 | maga | 1598 |
| 4 | Trump | 1038 |
| 4 | tcot | 877 |
| 2 | news | 854 |
| 4 | NeverHillary | 747 |
| 3 | RejectedDebateTopics | 570 |
community_hashtags %>%
dplyr::mutate(hashtags = purrr::map2_chr(hashtags, community, ~paste0(paste0(rep(' ', as.numeric(.y)), collapse = ''), .x))) %>%
dplyr::arrange(desc(n)) %>%
dplyr::mutate(hashtags = factor(hashtags, unique(hashtags))) %>%
ggplot2::ggplot(ggplot2::aes(x = hashtags, y = n, fill = as.character(community))) +
ggplot2::geom_col(color = 'black') +
ggplot2::facet_wrap(~community, scales = 'free') +
ggplot2::scale_fill_manual(values = community_pal) +
ggplot2::coord_flip()+
ggplot2::theme_bw() +
ggplot2::theme(legend.position="none") +
ggplot2::labs(
x = ''
)
We won’t conduct a full text analysis of these tweets, but it is worth mentioning that in topic modelling, we are often tasked with the tokenization of text - that is, we need to split the text into single words. We are also tasked with the removal of stop words or junk words that only add noise to the model. The interesting thing about the analysis we have is that hashtags serve as a type of tokenized text. It also doesn’t need to be cleaned because all hashtags, by nature of them being explicitly created, are important.
We will proceed to create a topic model with these hashtags. If you want to dig deeper into the topic modelling world, then I highly recommend reading Silge and Robinson’s Tidy Text Mining in R.
Let’s revisit the user -> hashtag edge list we created earlier and select edges that only belong in the larger component we just explored.
tweet_hashtag_edges <- tweet_hashtag_edges %>%
dplyr::filter(hashtags %in% igraph::V(largestComponent)$name) %>%
dplyr::select(-type)
With this data frame we can create something called a document term matrix. This is a matrix where the documents are the rows, the terms are the columns, and their co-occurance is stored in their intersection. Let’s create one:
tweets_sparse_hash <- tweet_hashtag_edges %>%
tidytext::cast_sparse(user_key, hashtags, weight)
tweets_sparse_hash[1:10, 1:5]
## 10 x 5 sparse Matrix of class "dgCMatrix"
## #politics #maga #news #makemehateyouinonephrase #trump
## @newspeakdaily 735 . . . .
## @ameliebaldwin 2 597 6 . 322
## @kansasdailynews 326 . 573 . .
## @onlinecleveland 501 . . . .
## @washingtonline 393 . 8 . .
## @todaybostonma 377 . 2 . .
## @hyddrox 1 363 6 . 280
## @giselleevns . . . 334 6
## @batonrougevoice 315 . 9 . .
## @specialaffair . . 190 . .
The beautiful thing about this kind of matrix is that there are a number of topic modelling functions that can work with this. Latent Dirichlet Allocation (LDA) is one that is frequently used. However, we will work with Structural Topic Models (STM). If you want to learn more about STM, you can check out the package authors’ site which contain’s a number of great references. Let’s use STM to identify 8 topics in our text. I chose 8 to match the
##Model takes a minute.
##I just pre ran it for you.
# set.seed(4321)
# topic_model_hash <- stm::stm(tweets_sparse_hash, K = 8,
# verbose = FALSE, init.type = "Spectral")
# readr::write_rds(topic_model_hash, 'twitter_troll_topic_model_8.rds')
topic_model_hash <- readr::read_rds("twitter_troll_topic_model_8.rds")
summary(topic_model_hash)
## Length Class Mode
## mu 2 -none- list
## sigma 49 -none- numeric
## beta 1 -none- list
## settings 13 -none- list
## vocab 846 -none- character
## convergence 4 -none- list
## theta 1456 -none- numeric
## eta 1274 -none- numeric
## invsigma 49 -none- numeric
## time 1 -none- numeric
## version 1 -none- character
We can grab the beta - the probability a term (here a hashtag) belongs to a topic.
td_beta_hash <- tidytext::tidy(topic_model_hash)
td_beta_hash %>%
head(8) %>%
knitr::kable()
| topic | term | beta |
|---|---|---|
| 1 | #politics | 0.9028386 |
| 2 | #politics | 0.0000000 |
| 3 | #politics | 0.0000000 |
| 4 | #politics | 0.0000000 |
| 5 | #politics | 0.0000000 |
| 6 | #politics | 0.0000000 |
| 7 | #politics | 0.0000000 |
| 8 | #politics | 0.0000000 |
Here we see that #politics has a strong probability of belonging to topic 1. It’s important to note that a term can belong to many topics. The beta simply tells us how likely we will see a particular wodrd in a particular topic. Let’s explore the words most closely related to each topic.
td_beta_hash %>%
dplyr::group_by(topic) %>%
dplyr::top_n(5, beta) %>%
dplyr::ungroup() %>%
dplyr::arrange(dplyr::desc(beta)) %>%
dplyr::mutate(term = purrr::map2_chr(term, topic, ~paste0(paste0(rep(' ', as.numeric(.y)), collapse = ''), .x))) %>%
dplyr::mutate(term = factor(term, unique(term))) %>%
dplyr::mutate(topic = paste0("Topic ", topic)) %>%
ggplot2::ggplot(ggplot2::aes(term, beta, fill = as.factor(topic))) +
ggplot2::geom_col(alpha = 0.8, show.legend = FALSE, color = 'black') +
ggplot2::facet_wrap(~ topic, scales = "free") +
ggplot2::coord_flip() +
ggplot2::labs(x = NULL, y = expression(beta),
title = "Grouping of Hashtags: Highest word probabilities for each topic",
subtitle = "Different words are associated with different topics") +
ggplot2::scale_fill_brewer(type = 'qual') +
ggplot2::theme_bw()
Now, like we did with the community detection alogrithm earlier, we can use these topics to mark or color the our graph. Again, while it is possible that a word can be strongly related to multiple topics, we will need to choose one topic to color a hashtag. To do this, we will choose the topic associated with the word’s highest beta.
markedTopic<- td_beta_hash %>%
dplyr::group_by(term) %>%
dplyr::top_n(1, wt = beta) %>%
dplyr::select(term, topic, beta)
markedTopic %>%
head %>%
knitr::kable()
| term | topic | beta |
|---|---|---|
| #politics | 1 | 0.9028386 |
| #maga | 7 | 0.1008277 |
| #news | 4 | 0.3310762 |
| #makemehateyouinonephrase | 5 | 0.0544027 |
| #trump | 7 | 0.0885790 |
| #neverhillary | 6 | 0.0458019 |
topicLargestComponent <- largestComponent %>%
igraph::as_data_frame('both') %>%
{
.$vertices <- dplyr::left_join(.$vertices, markedTopic, by = c('name' = 'term')) %>%
dplyr::mutate(color = purrr::map_chr(topic, ~hashtag_pal[.x]));
igraph::graph_from_data_frame(.$edges, F, .$vertices)
}
plot(topicLargestComponent, vertex.label = '', asp = 0, vertex.size = 3, edge.width = .1)
The really cool thing about working with twitter data sets is that you can explore a lot of different connections. You can follow a retweet network, you can see how hashtags relate to eachother and you can even explore how different people follow one another. I’m not sure if analysis of this particular dataset taught us anything new about the Russian Twitter Trolls, but it did give us an opportunity to see how to graph networks in R and to see how the igraph package functions within the greater R ecosystem. I hope this tutorial helped you learn something.
Cheers,
Ben